library(tidyverse)
library(caret)
library(rpart)
library(ggplot2)
library(GGally)
library(gridExtra)
library(plotly)
library(knitr)
library(kableExtra)
In this project, we will use data science to explore the inner workings of the Canadian labor market. By leveraging the rich information contained in the Labour Force Survey (LFS), a monthly survey conducted by Statistics Canada, we will uncover insights and trends that can help us understand the current state of the labor market in Canada.
We will use R to clean and transform the LFS data, and then use exploratory data analysis to uncover patterns and trends hidden within the data. Our goal is to use predictive modeling to predict our target variable, “NOC_10,” which refers to a worker’s occupation at their main job. This will allow us to classify and understand the data in a new way.
By the end of this project, we will have a greater understanding of the Canadian labor market and the ability to use data science to uncover valuable insights.
The Labour Force Survey (LFS) is a monthly survey by Statistics Canada which measures the current state of the Canadian labour market. The dataset used in this Project is released as a public use microdata file containing non-aggregated data for numerous variables from the October 2022 LFS.
# set working directory, if required
# setwd(choose.dir())
# import dataset
lfs <- read_csv('raw_data.csv')
The raw dataset contains 110,059 records and 60 variables. For this project, we will first focus on 12 of the variables and only the records with valid values (i.e. remove records with null values). Table 1 below shows a summary of the 12 variables.
# Subset dataset and remove rows with null values
lfs <- lfs[,c(5,7,9:11,16:18,23,34,36,37)]
lfs <- na.omit(lfs)
# Show variable meanings
text_tbl <- data.frame(
Variables = c("AGE_12", "EDUC", "HRLYEARN", "IMMIG", "MARSTAT", "NAICS_21", "NOC_10", "PROV", "SEX", "TENURE", "UHRSMAIN", "UNION"),
Meanings = c("Age group of respondent (increments of 5 year)", "Highest educational attainment", "Hourly wages", "Immigrant status", "Marital status of respondent", "Industry (by North American Industry Classification System)", "Occupation (by National Occupation Classification)", "Province", "Sex of respondent", "Job tenure with current employer in months", "Usual hours worked per week at main job", "Union status")
)
kbl(text_tbl) %>%
kable_paper(full_width = F) %>%
column_spec(1, bold = T, border_right = T, background = "#fda172") %>%
column_spec(2, background = "oldlace")
| Variables | Meanings |
|---|---|
| AGE_12 | Age group of respondent (increments of 5 year) |
| EDUC | Highest educational attainment |
| HRLYEARN | Hourly wages |
| IMMIG | Immigrant status |
| MARSTAT | Marital status of respondent |
| NAICS_21 | Industry (by North American Industry Classification System) |
| NOC_10 | Occupation (by National Occupation Classification) |
| PROV | Province |
| SEX | Sex of respondent |
| TENURE | Job tenure with current employer in months |
| UHRSMAIN | Usual hours worked per week at main job |
| UNION | Union status |
The variable data types are stored as numeric or character in the
original dataset. Only the UHRSMAIN, TENURE
and HRLYEARN are numerical variables, and the remaining
ones need to be converted to categorical data types. For ease of
interpretation and reference of the data, the variable levels are
renamed.
Since the HRLYEARN variable is in the format of cents
per hour in the original dataset, we will change it to dollars per hour.
A similar approach is taken for the UHRSMAIN variable,
where the values are recalculated from ‘implied decimal’ to actual
hours. Table 2 below shows a preview of first 6 rows of the dataset,
followed by a summary of the dataset that is ready for exploratory data
analysis.
# Factorize variables
lfs <- lfs %>% mutate_at(c(1:8, 12), as.factor)
# Rename variable levels
lfs$PROV <- recode_factor(lfs$PROV,'10'='NL','11'='PEI','12'='NS','13'='NB','24'='QC',
'35'='ON','46'='MB','47'='SK','48'='AB','59'='BC')
lfs$AGE_12 <- recode_factor(lfs$AGE_12,
'01'='15-19yrs','02'='20-24yrs','03'='25-29yrs',
'04'='30-34yrs','05'='35-39yrs','06'='40-44yrs',
'07'='45-49yrs','08'='50-54yrs','09'='55-59yrs',
'10'='60-64yrs','11'='65-69yrs','12'='70andUp')
levels(lfs$SEX)<-list('Male'='1','Female'='2')
levels(lfs$MARSTAT)<-list('Married'='1', 'Common-law'='2', 'Widowed'='3',
'Separated'='4', 'Divorced'='5', 'Single'='6')
levels(lfs$EDUC)<-list('0-8yrs'='0', 'HS_Some'='1', 'HS_Grad'='2',
'PSE_Some'='3', 'PSE_Grad'='4', 'Bachelors'='5','>Bachelors'='6')
levels(lfs$IMMIG)<-list('Immigrant_New'='1', 'Immigrant>10yrs'='2', 'Non-Immigrant'='3')
levels(lfs$UNION)<-list('Union'='1', 'Union_nonMember'='2','Non-Union'='3')
levels(lfs$NOC_10)<-list('Management'='01','Business, finance, administration'='02',
'Natural & applied sciences'='03','Health'='04',
'Education, law and social, community & government'='05',
'Art, culture, recreation & sport'='06',
'Sales & service'='07','Trades, transport & equipment operators'='08',
'Natural resources & agriculture'='09','Manufacturing & utilities'='10')
levels(lfs$NAICS_21)<-list('Agriculture'='01',
'Forestry and logging and support activities for forestry'='02',
'Fishing, hunting and trapping'='03',
'Mining, quarrying, and oil and gas extraction'='04',
'Utilities'='05',
'Construction'='06',
'Manufacturing - durable goods'='07',
'Manufacturing - non-durable goods'='08',
'Wholesale trade'='09',
'Retail trade'='10',
'Transportation and warehousing' = '11',
'Finance and insurance' = '12',
'Real estate and rental and leasing' = '13',
'Professional, scientific and technical services' = '14',
'Business, building and other support services' = '15',
'Educational services' = '16',
'Health care and social assistance' = '17',
'Information, culture and recreation' = '18',
'Accommodation and food services' = '19',
'Other services (except public administration)' = '20',
'Public administration' = '21')
# Change units of HRLYEARN and UHRSMAIN
lfs <- lfs %>% mutate(HRLYEARN = HRLYEARN / 100, UHRSMAIN = UHRSMAIN / 10)
kbl(head(lfs)) %>%
kable_paper(full_width = T)
| PROV | AGE_12 | SEX | MARSTAT | EDUC | IMMIG | NAICS_21 | NOC_10 | UHRSMAIN | TENURE | HRLYEARN | UNION |
|---|---|---|---|---|---|---|---|---|---|---|---|
| NS | 25-29yrs | Female | Common-law | PSE_Grad | Non-Immigrant | Retail trade | Sales & service | 35 | 20 | 21.15 | Non-Union |
| SK | 35-39yrs | Female | Separated | PSE_Grad | Non-Immigrant | Educational services | Sales & service | 40 | 25 | 20.14 | Union |
| QC | 40-44yrs | Female | Married | Bachelors | Immigrant_New | Health care and social assistance | Health | 36 | 22 | 26.71 | Non-Union |
| NL | 30-34yrs | Female | Married | Bachelors | Non-Immigrant | Business, building and other support services | Sales & service | 37 | 121 | 23.39 | Non-Union |
| QC | 60-64yrs | Male | Common-law | PSE_Grad | Non-Immigrant | Finance and insurance | Sales & service | 40 | 181 | 39.00 | Non-Union |
| QC | 25-29yrs | Female | Common-law | PSE_Grad | Non-Immigrant | Health care and social assistance | Education, law and social, community & government | 24 | 42 | 21.86 | Non-Union |
str(lfs)
## tibble [56,302 x 12] (S3: tbl_df/tbl/data.frame)
## $ PROV : Factor w/ 10 levels "NL","PEI","NS",..: 3 8 5 1 5 5 6 10 9 9 ...
## $ AGE_12 : Factor w/ 12 levels "15-19yrs","20-24yrs",..: 3 5 6 4 10 3 7 1 8 7 ...
## $ SEX : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 1 2 1 2 ...
## $ MARSTAT : Factor w/ 6 levels "Married","Common-law",..: 2 4 1 1 2 2 5 6 1 1 ...
## $ EDUC : Factor w/ 7 levels "0-8yrs","HS_Some",..: 5 5 6 6 5 5 6 2 7 6 ...
## $ IMMIG : Factor w/ 3 levels "Immigrant_New",..: 3 3 1 3 3 3 2 3 3 2 ...
## $ NAICS_21: Factor w/ 21 levels "Agriculture",..: 10 16 17 15 12 17 14 19 14 10 ...
## $ NOC_10 : Factor w/ 10 levels "Management","Business, finance, administration",..: 7 7 4 7 7 5 3 7 1 7 ...
## $ UHRSMAIN: num [1:56302] 35 40 36 37 40 24 40 18 40 40 ...
## $ TENURE : num [1:56302] 20 25 22 121 181 42 49 4 240 75 ...
## $ HRLYEARN: num [1:56302] 21.1 20.1 26.7 23.4 39 ...
## $ UNION : Factor w/ 3 levels "Union","Union_nonMember",..: 3 1 3 3 3 3 3 3 3 3 ...
## - attr(*, "na.action")= 'omit' Named int [1:53757] 2 3 4 5 6 9 10 11 17 18 ...
## ..- attr(*, "names")= chr [1:53757] "2" "3" "4" "5" ...
Among the 12 variables within the dataset, we are interested in
predicting occupation type based on other variables. In the following
steps, we will explore relationships between NOC_10 (our
target variable) and other variables, determine if the target variable
is appropriate, and identify variables that may be significant in the
subsequent predictive modelling.
The density plot below shows the distribution of the hourly earning, among workers in the current state of the Canadian labour market. The blue curve on the overall plot shows the shape of distribution, with the highest density at around $18-25 per hour.
HRLYEARNggplot(lfs, aes(x = HRLYEARN)) +
geom_density(fill = "#00A6ED") +
xlab("Earnings ($/Hour)") +
ylab("Density") +
ggtitle("Hourly Earnings Data Distribution") +
scale_color_manual(values = c("#00A6ED", "#FF5B5B")) +
theme_minimal() +
theme(plot.title = element_text(size = 20, color = "#FF5B5B"),
axis.title = element_text(size = 14, color = "#FF5B5B"),
legend.title = element_text(size = 14, color = "#FF5B5B"),
legend.text = element_text(size = 12, color = "#FF5B5B")) +
labs(subtitle = "Data from the October 2022 Labour Force Survey") +
theme(panel.grid = element_line(color = "#FF5B5B", size = 0.5))+
geom_vline(aes(xintercept=mean(HRLYEARN)), color="blue", linetype="dashed", size=1)
Since some people make absurd amounts of money, we decided to remove outliers using the 1.5 IQR rule, meaning any data points that are more than 1.5*IQR above the third quartile or below the first quartile will get removed from the dataset. This change normalizes the data, and allows further evaluation to concentrate within the majority.
# Normalize HRLYEARN values
iqr <- 40 - 20
lower_boundary <- 20 - 1.5 * iqr
upper_boundary <- 40 + 1.5 * iqr
lfs <- lfs %>% filter(HRLYEARN <= upper_boundary & HRLYEARN >= lower_boundary)
HRLYEARN regraphed after
removing outliers# Density plot after removing outliers
ggplot(lfs, aes(x = HRLYEARN)) +
geom_density(fill = "#00A6ED") +
xlab("Earnings ($/Hour)") +
ylab("Density") +
ggtitle("Hourly Earnings Data Distribution") +
scale_color_manual(values = c("#00A6ED", "#FF5B5B")) +
theme_minimal() +
theme(plot.title = element_text(size = 20, color = "#FF5B5B"),
axis.title = element_text(size = 14, color = "#FF5B5B"),
legend.title = element_text(size = 14, color = "#FF5B5B"),
legend.text = element_text(size = 12, color = "#FF5B5B")) +
labs(subtitle = "Data from the October 2022 Labour Force Survey") +
theme(panel.grid = element_line(color = "#FF5B5B", size = 0.5))+
geom_vline(aes(xintercept=mean(HRLYEARN)), color="blue", linetype="dashed", size=1)
The histogram in Figure 3 below shows that the majority of weekly hours worked fall between 38-40 hours per week, with a peak at around 40 hours. The mean of approximately 35 is to the left of the peak meaning the distribution is slightly left-skewed. The plot indicates that there are relatively few observations with weekly hours above 40 per week, and a minimal number of observations with hours worked below 30.
UHRSMAINggplot(lfs, aes(x = UHRSMAIN)) +
geom_histogram(fill = "#00A6ED", color = "#0072C6", binwidth = 2) +
xlab("Hours/week") +
ylab("Frequency") +
ggtitle("Distribution of Weekly Work Hours") +
scale_color_manual(values = c("#00A6ED", "#FF5B5B")) +
theme_minimal() +
theme(plot.title = element_text(size = 20, color = "#FF5B5B"),
axis.title = element_text(size = 14, color = "#FF5B5B"),
legend.title = element_text(size = 14, color = "#FF5B5B"),
legend.text = element_text(size = 12, color = "#FF5B5B")) +
labs(subtitle = "Data from the October 2022 Labour Force Survey") +
theme(panel.grid = element_line(color = "#FF5B5B", size = 0.5))+
geom_vline(aes(xintercept=mean(UHRSMAIN)), color="blue", linetype="dashed", size=1)
The density plot in Figure 4 below shows the distribution of tenure for individuals employed by a company or organization. The majority of individuals have been employed for between 0 and 5 years, with an average tenure of 18 months in this range. The plot then shows a decline in the density of individuals with longer tenure. This data is capped at 240 months (or 20 years), which is why there is a second peak at 240. This information can be useful for understanding the overall tenure of employees within the occupation.
TENUREggplot(lfs, aes(x = TENURE)) +
geom_density(fill = "#00A6ED") +
xlab("Months") +
ylab("Density") +
ggtitle("Tenure in Months") +
scale_color_manual(values = c("#00A6ED", "#FF5B5B")) +
theme_minimal() +
theme(plot.title = element_text(size = 20, color = "#FF5B5B"),
axis.title = element_text(size = 14, color = "#FF5B5B"),
legend.title = element_text(size = 14, color = "#FF5B5B"),
legend.text = element_text(size = 12, color = "#FF5B5B")) +
scale_fill_gradient(low = "#00A6ED", high = "#0072C6") +
labs(subtitle = "Data from the October 2022 Labour Force Survey") +
theme(panel.grid = element_line(color = "#FF5B5B", size = 0.5))+
geom_vline(aes(xintercept=mean(TENURE)), color="blue", linetype="dashed", size=1)
UHRSMAIN
vs. HRLYEARNlfs_scatter <- lfs %>%
ggplot(aes(HRLYEARN,UHRSMAIN)) +
geom_point(alpha=0.2, size=1.5, color="navy") +
labs(y="Hours Worked Per Week", x="Hourly Earnings ($)", title="Hours Worked vs. Hourly Earnings")+
geom_smooth(method='lm', color = 'red')
lfs_scatter
The scatterplot shows the hourly earnings ($) on the x-axis and the hours worked per week on the y-axis. We can conclude from this graph that the hourly earnings have minimal effect on the amount of hours worked. The Linear Regression Line in red almost has a slope of close to 0 meaning there is minimal correlation between the two variables. We can conclude from this graph that the hourly earnings have minimal effect on the number of hours worked per week but that they could both be good as predictors in our model as they can predict independently.
NOC_10 and
NAICS_21bar <- lfs %>%
ggplot(aes(y=NOC_10, fill=NAICS_21))+
geom_bar(position=position_stack(reverse=TRUE), color='grey',width = .5)+
ggtitle("Data Distribution of Occupation and Industry")+
ylab ("Occupation")+
xlab("Count by Industry")+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
theme(plot.title = element_text(size=14,face="bold"),
axis.title=element_text(size=12,face="bold"))
ggplotly(bar)
As shown in Figure 6 above, it is evident that the survey records are not well distributed across occupation types, with more than 10 times the amount of survey subjects working in the sales and service occupation in comparison to occupations in art, culture, recreation and sport.
There seem to be some patterns between occupation and industry. For example, 4726 people with health occupations out of 4765 work in the health care and social assistance industry, and most people working in the retail trade industry have sales and service occupations.
NOC_10 and UNION
Comparisonbarchart_union <- ggplot(lfs, aes(x = NOC_10, fill = UNION)) +
geom_bar(aes(y = (..count..) / sum(..count..)), position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels=function(x) str_wrap(x,width=20))+
ylab("Percentage") + xlab("Occupation") +
ggtitle("Occupation and Union Comparison") + coord_flip() +
theme(plot.title = element_text(size=14,face="bold"),
axis.title=element_text(size=10,face="bold"))
ggplotly(barchart_union)
The barchart in Figure 7 above shows the percentage of union members in each occupation. We can see that there is a higher percentage of union members in “Health occupations” and “Occupations in education, law and social, community and government services”. This could be good for our predictive model because it adds separability.
The boxplot in Figure 8 below shows the Hourly Earnings for each industry, divided by male and female. We can see that “management occupations” earn the most money, followed by “natural and applied sciences and related occupations” and “health occupations”. We also see that men earn, on average, more money than woman. Both factors add separability to the predictive model.
HRLYEARN per NOC_10 Grouped by
SEXboxplot_HRLYEARN <- ggplot(lfs, aes(x=NOC_10, y=HRLYEARN, fill=SEX)) +
geom_boxplot(width = .8) + ylab("Hourly Earnings ($)") + xlab('Occupation') +
labs(title = "Hourly Earnings per Occupation Type Divided By Sex") +
scale_x_discrete(labels=function(x) str_wrap(x,width=20))+
scale_fill_manual(values = c("#8ECEFD", "#F88B9D")) + coord_flip()+
guides(fill = guide_legend(reverse = TRUE))+
scale_y_continuous(n.breaks = 7) +
theme(plot.title = element_text(size=14,face="bold"),
axis.title=element_text(size=12,face="bold"))
boxplot_HRLYEARN
The heatmap below (Figure 9) allows us to visualize our data in the form of hot and cold spots using a warm-to-cool color scheme. The areas in floral white indicate an area of low frequency and the areas in dark orange indicate an area of high frequency. We can see that our dataset contains a high frequency of post-secondary graduates and a high frequency of people working in sales and service. In addition, a high percentage of high school graduates work in sales and service, which can be a key predictor for the classification model.
NOC_10 and EDUC Comparisonheatmap_edu <- lfs %>%
select(NOC_10, EDUC) %>%
xtabs(~., data = .) %>%
data.frame() %>%
ggplot(aes(NOC_10, EDUC, fill = Freq)) +
geom_tile(col = "white")+
labs(title = "Occupation and Education Comparison",
x = "Occupation", y = "Education") +
scale_x_discrete(labels=function(x) str_wrap(x,width=20))+
scale_fill_gradient(low = "floralwhite", high = "darkorange3") +
coord_flip() + theme(plot.title = element_text(size=14,face="bold"),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle=-45, hjust=-0.1))
options(repr.plot.width = 40, repr.plot.height = 8)
heatmap_edu
Note that since there is a high variation of records across different occupations, the ‘heated’ areas in the heatmap are biased towards those with higher number of records. For example the number of records with occupation in art, culture, recreation and sport is much lower than the other occupations, which leads to low frequencies in the heatmap.
NOC_10 and PROV
Comparisonheatmap_prov <- lfs %>%
select(NOC_10, PROV) %>%
xtabs(~., data = .) %>%
data.frame() %>%
ggplot(aes(NOC_10, PROV, fill = Freq)) +
geom_tile(col = "white")+
labs(title = "Occupation and Province Comparison",
x = "Occupation", y = "Province") +
scale_x_discrete(labels=function(x) str_wrap(x,width=20))+
scale_y_discrete(limits=rev)+
scale_fill_gradient(low = "floralwhite", high = "darkorange3") +
coord_flip() + theme(plot.title = element_text(size=14,face="bold"),
axis.title=element_text(size=12,face="bold"))
options(repr.plot.width = 40, repr.plot.height = 8)
heatmap_prov
As seen in Figure 10, there is a higher frequency of records in Quebec and Ontario. However there is no prominent distinction in the frequency of occupations among different provinces. The frequency variation is mainly due to uneven distribution of records.
In this section, we will explore various classification algorithms to
predict the target variable - ‘Occupation’(NOC_10), based
on other variables. As noted in Figure 6, the survey records are not
well distributed across occupation types. In addition, over 50,000
records will take significant compute time to perform predictive
modelling. For the purpose of this demonstration, we will reduce the
total records by randomly sampling survey records in 2 ways:
Method 1 aims to maintain the distribution of dataset, but may lose training data for occupation types with low count of records. The opposite applies to method 2 where there are equal amounts of training data for each occupation type, but the true data distribution is not reflected.
NOC_10 (Original vs
Resample Method 1)# Data Sampling Method 1
set.seed(10)
lfs_sample1 <- lfs %>% sample_n(10000) %>% select(-c('PROV'))
lfs_sample1 <- as.data.frame(lfs_sample1)
ggplot()+
geom_bar(data = lfs, aes(x=NOC_10, fill = "Original"))+
geom_bar(data = lfs_sample1, aes(x=NOC_10, fill = "Sample"))+
ggtitle("Data Distribution by Occupation")+
xlab ("")+
ylab("Count")+
scale_x_discrete(labels=function(x) str_wrap(x,width=20))+
theme(plot.title = element_text(size=14,face="bold"),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle=-45, hjust=-0.1),
legend.title = element_blank())
# Data Sampling Method 2
set.seed(10)
df1 <- lfs %>%
filter(NOC_10 == 'Management') %>%
sample_n(1000)
df2 <- lfs %>%
filter(NOC_10 == 'Business, finance, administration') %>%
sample_n(1000)
df3 <- lfs %>%
filter(NOC_10 == 'Natural & applied sciences') %>%
sample_n(1000)
df4 <- lfs %>%
filter(NOC_10 == 'Health') %>%
sample_n(1000)
df5 <- lfs %>%
filter(NOC_10 == 'Education, law and social, community & government') %>%
sample_n(1000)
df6 <- lfs %>%
filter(NOC_10 == 'Art, culture, recreation & sport') %>%
sample_n(1000)
df7 <- lfs %>%
filter(NOC_10 == 'Sales & service') %>%
sample_n(1000)
df8 <- lfs %>%
filter(NOC_10 == 'Trades, transport & equipment operators') %>%
sample_n(1000)
df9 <- lfs %>%
filter(NOC_10 == 'Natural resources & agriculture') %>%
sample_n(1000)
df10 <- lfs %>%
filter(NOC_10 == 'Manufacturing & utilities') %>%
sample_n(1000)
lfs_sample2 <- bind_rows(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, .id = NULL)
lfs_sample2 <- lfs_sample2 %>% select(-c('PROV')) %>% as.data.frame()
Prior to running the predictive classification models, the dataset is
first separated into training and validation sets. 80% of the records
are partitioned as training set based on the target variable
(NOC_10), and the remaining 20% as validation set. Table 4
below shows a summary of record counts in each data set.
# Separate dataset into training and validation sets
set.seed(9)
inTraining_1 <- createDataPartition(lfs_sample1$NOC_10,p=0.8,list=FALSE)
training_1 <- lfs_sample1[inTraining_1,]
validation_1 <- lfs_sample1[-inTraining_1,]
inTraining_2 <- createDataPartition(lfs_sample2$NOC_10,p=0.8,list=FALSE)
training_2 <- lfs_sample2[inTraining_2,]
validation_2 <- lfs_sample2[-inTraining_2,]
# get counts for NOC_10 distribution in training and validation datasets
training_count1 <- tapply(training_1$NOC_10,training_1$NOC_10,length)
training_count2 <- tapply(training_2$NOC_10,training_2$NOC_10,length)
validation_count1 <- tapply(validation_1$NOC_10,validation_1$NOC_10,length)
validation_count2 <- tapply(validation_2$NOC_10,validation_2$NOC_10,length)
count <- as_data_frame(training_count1)
count$Occupation <- levels(lfs$NOC_10)
count$Validation_Method1 <- validation_count1
count$Training_Method2 <- training_count2
count$Validation_Method2 <- validation_count2
count %>% rename('Training_Method1'='value') %>%
select('Occupation', 'Validation_Method1','Validation_Method2')
Prior to running the models, the algorithm parameters are set using
10-fold cross validation. Through exploratory data analysis, it was
determined that there is minimal relationship between PROV
and our target variable. Although some relationships exist between
NOC_10 and other variables, they were inconclusive and
worth exploring in the predictive models. All variables except for
PROV are used to train and predict the Occupation within
the dataset. Below is a list of variables used in training models.
# Set training algorithm parameters
control <- trainControl(method = 'cv',number=10)
metric <- 'Accuracy'
# Display variables used in the model
colnames(lfs_sample1)
## [1] "AGE_12" "SEX" "MARSTAT" "EDUC" "IMMIG" "NAICS_21"
## [7] "NOC_10" "UHRSMAIN" "TENURE" "HRLYEARN" "UNION"
5 preliminary machine learning algorithms were run, including:
All algorithms ran successfully in R, however the Random Forest produced 100% accuracy in the corresponding confusion matrices, which is unlikely. Therefore only the mean accuracy of the RF model will be evaluated and compared against other models.
# LDA
# Method 1
set.seed(5)
fit.lda1 <-train(NOC_10~.,data=lfs_sample1,method='lda',metric=metric,trControl=control)
predictions_lda1 <- predict(fit.lda1, validation_1)
cm_lda1 <- confusionMatrix(predictions_lda1, as.factor(validation_1$NOC_10))
# Method 2
set.seed(5)
fit.lda2 <-train(NOC_10~.,data=lfs_sample2,method='lda',metric=metric,trControl=control)
predictions_lda2 <- predict(fit.lda2, validation_2)
cm_lda2 <- confusionMatrix(predictions_lda2, as.factor(validation_2$NOC_10))
# KNN
# Method 1
set.seed(5)
fit.knn_1 <-train(NOC_10~.,data=lfs_sample1,method='knn',metric=metric,trControl=control)
predictions_knn_1 <- predict(fit.knn_1, validation_1)
cm_knn_1 <- confusionMatrix(predictions_knn_1, as.factor(validation_1$NOC_10))
# Method 2
set.seed(5)
fit.knn_2 <-train(NOC_10~.,data=lfs_sample2,method='knn',metric=metric,trControl=control)
predictions_knn_2 <- predict(fit.knn_2, validation_2)
cm_knn_2 <- confusionMatrix(predictions_knn_2, as.factor(validation_2$NOC_10))
# CART
# Method 1
set.seed(5)
fit.cart1 <-train(NOC_10~.,data=lfs_sample1,method='rpart',metric=metric,trControl=control)
predictions_cart1 <- predict(fit.cart1, validation_1)
cm_cart_1 <- confusionMatrix(predictions_cart1, as.factor(validation_1$NOC_10))
# Method 2
set.seed(5)
fit.cart_2 <-train(NOC_10~.,data=lfs_sample2,method='rpart',metric=metric,trControl=control)
predictions_cart_2 <- predict(fit.cart_2, validation_2)
cm_cart_2 <- confusionMatrix(predictions_cart_2, as.factor(validation_2$NOC_10))
# SVM
# Method 1
start_time <- Sys.time()
set.seed(5)
fit.svm1 <-train(NOC_10~.,data=lfs_sample1,method='svmRadial',metric=metric,trControl=control)
predictions_svm1 <- predict(fit.svm1, validation_1)
end_time <- Sys.time()
cm_svm_1 <- confusionMatrix(predictions_svm1, as.factor(validation_1$NOC_10))
run_time <- end_time - start_time
# Method 2
set.seed(5)
fit.svm_2 <-train(NOC_10~.,data=lfs_sample2,method='svmRadial',metric=metric,trControl=control)
predictions_svm_2 <- predict(fit.svm_2, validation_2)
cm_svm_2 <- confusionMatrix(predictions_svm_2, as.factor(validation_2$NOC_10))
# RF
# Method 1
set.seed(5)
fit.rf_1 <-train(NOC_10~.,data=lfs_sample1,method='rf',metric=metric,trControl=control)
predictions_rf_1 <- predict(fit.rf_1, validation_1)
cm_rf_1 <- confusionMatrix(predictions_rf_1, as.factor(validation_1$NOC_10))
# Method 2
set.seed(5)
fit.rf_2 <-train(NOC_10~.,data=lfs_sample2,method='rf',metric=metric,trControl=control)
predictions_rf_2 <- predict(fit.rf_2, validation_2)
cm_rf_2 <- confusionMatrix(predictions_rf_2, as.factor(validation_2$NOC_10))
As shown in Figure 12 below, the Random Forest with re-sample method
2 produced the most accurate predictive model, followed by the Support
Vector Machine. However, there is no consistent result regarding the
data sample methods, i.e. training equal number of records for each
NOC_10 levels did not consistently lead to more accurate
model output, or vice versa. The model performances are further
evaluated through the confusion matrices in the next section.
comparison <- resamples(list(lda1=fit.lda1, cart1=fit.cart1, knn1=fit.knn_1,
lda2=fit.lda2, cart2=fit.cart_2, knn2=fit.knn_2,
svm1=fit.svm1, svm2=fit.svm_2, rf1=fit.rf_1, rf2=fit.rf_2))
comparison <- as.data.frame(comparison)
comparison_tidy <- comparison %>%
pivot_longer(names_to = "Model", values_to = "Accuracy", -Resample) %>%
group_by(Model) %>%
summarise(Mean_Accuracy = mean(Accuracy))
mean_acc <- comparison_tidy %>%
ggplot(aes(x=fct_reorder(Model, Mean_Accuracy), y=Mean_Accuracy))+
geom_bar(stat = "identity")+
stat_summary(fun.y=mean, geom='text', color='white',hjust=1.2, aes(label=round(..y.., digits=2)))+
coord_flip()+
xlab("Model")+
ylab("Mean Accuracy")+
theme(text = element_text(size = 12))
mean_acc
Confusion matrix is used as a performance measurement for machine learning classification. Generally, the higher number of correct classifications, the more accurate the model. In the tables below, all green cells indicate the number of correct classifications (i.e. the predicted value matches the actual value), and the orange cells indicate incorrect classifications. The green and orange color shades correspond to number of values, i.e. darker shades indicate higher numbers. Grey cells indicate that no value is predicted in the corresponding class.
# Confusion matrix Method 1
cm_lda1 <- as.data.frame(cm_lda1$table)
cm_lda1$diag <- cm_lda1$Prediction == cm_lda1$Reference # Get the Diagonal
cm_lda1$ndiag <- cm_lda1$Prediction != cm_lda1$Reference # Off Diagonal
cm_lda1[cm_lda1 == 0] <- NA # Replace 0 with NA for white tiles
cm_lda1$Reference <- fct_rev(cm_lda1$Reference) # diagonal starts at top left
cm_lda1$ref_freq <- cm_lda1$Freq * ifelse(is.na(cm_lda1$diag),-1,1)
plt_lda1 <- ggplot(data = cm_lda1, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile( data = cm_lda1,aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
options(repr.plot.width = 20, repr.plot.height = 8)
plt_lda1
This model is the strongest at predicting Sales & service occupation, and the weakest at predicting occupation in Art, culture, recreation & sport, as well as Management. This may be partly due to the number of training data as there were more Sales & service records and limited art records in the training data.
# confusion matrix Method 2
cm_lda2 <- as.data.frame(cm_lda2$table)
cm_lda2$diag <- cm_lda2$Prediction == cm_lda2$Reference # Get the Diagonal
cm_lda2$ndiag <- cm_lda2$Prediction != cm_lda2$Reference # Off Diagonal
cm_lda2[cm_lda2 == 0] <- NA # Replace 0 with NA for white tiles
cm_lda2$Reference <- fct_rev(cm_lda2$Reference) # diagonal starts at top left
cm_lda2$ref_freq <- cm_lda2$Freq * ifelse(is.na(cm_lda2$diag),-1,1)
plt_lda2 <- ggplot(data = cm_lda2, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile( data = cm_lda2,aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
options(repr.plot.width = 12, repr.plot.height = 8)
plt_lda2
In comparison to resample method 1, the performance of this model is more consistent overall. The model correctly classified 196 out of 200 records that are in the Manufacturing & utilities occupation, but classified poorly of those in the management occupation.
Similar difference between the two resample methods can be found in the KNN model, which is related to how the data was distributed and partitioned. As shown in Table 7 and Table 8 below, KNN using resample method 1 performed the strongest in classifying Sales & service, while resample method 2 is strongest in predicting Manufacturing & utilities.
cm_knn_1 <- as.data.frame(cm_knn_1$table)
cm_knn_1$diag <- cm_knn_1$Prediction == cm_knn_1$Reference # Get the Diagonal
cm_knn_1$ndiag <- cm_knn_1$Prediction != cm_knn_1$Reference # Off Diagonal
cm_knn_1[cm_knn_1 == 0] <- NA # Replace 0 with NA for white tiles
cm_knn_1$Reference <- fct_rev(cm_knn_1$Reference) # diagonal starts at top left
cm_knn_1$ref_freq <- cm_knn_1$Freq * ifelse(is.na(cm_knn_1$diag),-1,1)
plt_knn_1 <- ggplot(data = cm_knn_1, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile(data = cm_knn_1,aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
#options(repr.plot.width = 12, repr.plot.height = 8)
plt_knn_1
# confusion matrix code for knn algorithm model
cm_knn_2 <- as.data.frame(cm_knn_2$table)
cm_knn_2$diag <- cm_knn_2$Prediction == cm_knn_2$Reference # Get the Diagonal
cm_knn_2$ndiag <- cm_knn_2$Prediction != cm_knn_2$Reference # Off Diagonal
cm_knn_2[cm_knn_2 == 0] <- NA # Replace 0 with NA for white tiles
cm_knn_2$Reference <- fct_rev(cm_knn_2$Reference) # diagonal starts at top left
cm_knn_2$ref_freq <- cm_knn_2$Freq * ifelse(is.na(cm_knn_2$diag),-1,1)
plt_knn_2 <- ggplot(data = cm_knn_2, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile(data = cm_knn_2,aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
plt_knn_2
As shown in Table 9 and 10 below, classification using the CART model performed poorly in general, where both CART models failed to classify 7 out of 10 classes of the target variable. This is likely related to the algorithm and implies that the CART model is not suitable for predicting occupation based on our set variables and training parameters.
# confusion matrix code for CART algorithm model
cm_cart_1 <- as.data.frame(cm_cart_1$table)
cm_cart_1$diag <- cm_cart_1$Prediction == cm_cart_1$Reference # Get the Diagonal
cm_cart_1$ndiag <- cm_cart_1$Prediction != cm_cart_1$Reference # Off Diagonal
cm_cart_1[cm_cart_1 == 0] <- NA # Replace 0 with NA for white tiles
cm_cart_1$Reference <- fct_rev(cm_cart_1$Reference) # diagonal starts at top left
cm_cart_1$ref_freq <- cm_cart_1$Freq * ifelse(is.na(cm_cart_1$diag),-1,1)
plt_cart_1 <- ggplot(data = cm_cart_1, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile( data = cm_cart_1,aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
#options(repr.plot.width = 12, repr.plot.height = 8)
plt_cart_1
cm_cart_2 <- as.data.frame(cm_cart_2$table)
cm_cart_2$diag <- cm_cart_2$Prediction == cm_cart_2$Reference # Get the Diagonal
cm_cart_2$ndiag <- cm_cart_2$Prediction != cm_cart_2$Reference # Off Diagonal
cm_cart_2[cm_cart_2 == 0] <- NA # Replace 0 with NA for white tiles
cm_cart_2$Reference <- fct_rev(cm_cart_2$Reference) # diagonal starts at top left
cm_cart_2$ref_freq <- cm_cart_2$Freq * ifelse(is.na(cm_cart_2$diag),-1,1)
plt_cart_2 <- ggplot(data = cm_cart_2, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile( data = cm_cart_2, aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
#options(repr.plot.width = 12, repr.plot.height = 8)
plt_cart_2
SVM models performed better overall in comparison to other models. As seen in Table 12 below, a minimum of 66 records were classified correctly, while 195 records were correctly classified in the Manufacturing & utilities occupation.
cm_svm_1 <- as.data.frame(cm_svm_1$table)
cm_svm_1$diag <- cm_svm_1$Prediction == cm_svm_1$Reference # Get the Diagonal
cm_svm_1$ndiag <- cm_svm_1$Prediction != cm_svm_1$Reference # Off Diagonal
cm_svm_1[cm_svm_1 == 0] <- NA # Replace 0 with NA for white tiles
cm_svm_1$Reference <- fct_rev(cm_svm_1$Reference) # diagonal starts at top left
cm_svm_1$ref_freq <- cm_svm_1$Freq * ifelse(is.na(cm_svm_1$diag),-1,1)
plt_svm_1 <- ggplot(data = cm_svm_1, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile( data = cm_svm_1,aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
#options(repr.plot.width = 12, repr.plot.height = 8)
plt_svm_1
cm_svm_2 <- as.data.frame(cm_svm_2$table)
cm_svm_2$diag <- cm_svm_2$Prediction == cm_svm_2$Reference # Get the Diagonal
cm_svm_2$ndiag <- cm_svm_2$Prediction != cm_svm_2$Reference # Off Diagonal
cm_svm_2[cm_svm_2 == 0] <- NA # Replace 0 with NA for white tiles
cm_svm_2$Reference <- fct_rev(cm_svm_2$Reference) # diagonal starts at top left
cm_svm_2$ref_freq <- cm_svm_2$Freq * ifelse(is.na(cm_svm_2$diag),-1,1)
plt_svm_2 <- ggplot(data = cm_svm_2, aes(x = Prediction , y = Reference, fill = Freq))+
scale_x_discrete(position = "top") +
geom_tile(data = cm_svm_2, aes(fill = ref_freq)) +
scale_fill_gradient2(guide = FALSE ,low="red",high="#1bc46a",na.value = '#dee1e3') +
geom_text(aes(label = Freq), color = 'black', size = 5)+
scale_y_discrete(labels=function(y) str_wrap(y,width=20))+
scale_x_discrete(labels=function(x) str_wrap(x,width=20), position="top")+
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = "none",
panel.border = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(angle=45, hjust=-0.1)
)
plt_svm_2
The confusion matrices show that there are various factors that will affect the performance of a predictive models:
There is no conclusive result on the classification performance and accuracy based on the outputs of each predictive model. While some models are stronger at predicting certain classes, the opposite may happen for other models. An example would be classifying the Management occupation, where KNN with stratified samples performed well but LDA with stratified samples performed poorly.
How the training data is applied into the model may determine the classification metrics. When comparing the confusion matrices within the same algorithm but with a different data sampling method, we could see that the number of correctly classified variables sometimes corresponds to variable distribution from the training data set.
Another aspect of classification model is to determine if and which predictors are prominent in predicting the target variable. Variable importance is calculated differently between a model-free algorithm (LDA, KNN, SVM) and a model-based algorithm (CART, RF).
Based on information from Figure 13 below, Hourly Earning
(HRLYEARN) is the most prominent predictor in both
model-free algorithms and RF. Industry (NAICS_21) is a
significant factor in the CART model as well as in model-free
algorithms. Immigration Status (IMMIG) is the least
prominent predictor. In addition, numerical predictors seem to have more
weight than categorical predictors in the RF model.
# Variable importance
importance1 <- varImp(fit.lda1)
importance2 <- varImp(fit.cart1)
importance3 <- varImp(fit.knn_1)
importance4 <- varImp(fit.lda2)
importance5 <- varImp(fit.cart_2)
importance6 <- varImp(fit.knn_2)
importance7 <- varImp(fit.svm1)
importance8 <- varImp(fit.svm_2)
importance9 <- varImp(fit.rf_1)
importance10 <- varImp(fit.rf_2)
imp1 <- importance1$importance
imp2 <- importance2$importance
imp3 <- importance3$importance
imp4 <- importance4$importance
imp5 <- importance5$importance
imp6 <- importance6$importance
imp7 <- importance7$importance
imp8 <- importance8$importance
imp9 <- importance9$importance
imp10 <- importance10$importance
lda_imp1 <- imp1 %>%
mutate(Predictor = rownames(imp1)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
group_by(Predictor) %>%
summarise_at(vars(Importance), list(Importance = mean)) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("Model-Free (LDA,KNN,SVM) Method 1")+
xlab("")
cart_imp1 <- imp2 %>%
mutate(Predictor = rownames(imp2)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
arrange(desc(Importance)) %>%
top_n(6) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("CART Method 1")+
xlab("")
knn_imp1 <- imp3 %>%
mutate(Predictor = rownames(imp3)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
group_by(Predictor) %>%
summarise_at(vars(Importance), list(Importance = mean)) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("K-Nearest Neighbors Method 1")+
xlab("")
lda_imp2 <- imp4 %>%
mutate(Predictor = rownames(imp4)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
group_by(Predictor) %>%
summarise_at(vars(Importance), list(Importance = mean)) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("Model-Free (LDA,KNN,SVM) Method 1")+
xlab("")
cart_imp2 <- imp5 %>%
mutate(Predictor = rownames(imp5)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
arrange(desc(Importance)) %>%
top_n(6) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("CART Method 2")+
xlab("")
knn_imp2 <- imp6 %>%
mutate(Predictor = rownames(imp6)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
group_by(Predictor) %>%
summarise_at(vars(Importance), list(Importance = mean)) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("K-Nearest Neighbors Method 2")+
xlab("")
svm_imp1 <- imp7 %>%
mutate(Predictor = rownames(imp7)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
group_by(Predictor) %>%
summarise_at(vars(Importance), list(Importance = mean)) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("Support Vector Machine Method 1")+
xlab("")
svm_imp2 <- imp8 %>%
mutate(Predictor = rownames(imp8)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
group_by(Predictor) %>%
summarise_at(vars(Importance), list(Importance = mean)) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("Support Vector Machine Method 2")+
xlab("")
rf_imp1 <- imp9 %>%
mutate(Predictor = rownames(imp9)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
arrange(Importance) %>%
top_n(10) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("RF Method 1")+
xlab("")
rf_imp2 <- imp10 %>%
mutate(Predictor = rownames(imp10)) %>%
pivot_longer(names_to = "NOC_10", values_to = "Importance", -Predictor) %>%
arrange(Importance) %>%
top_n(10) %>%
ggplot(aes(x=Predictor, y=Importance))+
geom_segment(aes(x=Predictor, xend=Predictor, y=0, yend=Importance), color="skyblue") +
geom_point(color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 10))+
ylab("RF Method 2")+
xlab("")
grid.arrange(lda_imp1,lda_imp2,cart_imp1,cart_imp2,rf_imp1,rf_imp2,nrow=3)
The result aligns with our expectation. As noted in the Exploratory Data Analysis section, many occupations, such as health, are directly related to the industry of work. As shown in Figure 8, the hourly earnings vary across different occupations, providing separability which helps the classification process.
The IMMIG variable is ranked low across all variable
importance, we can therefore try to refine the model by removing the
insignificant variable. We will use the SVM algorithm using method 2
since it was the most accurate model, and test if the accuracy can be
improved.
set.seed(5)
lfs_sample_noImmi <- lfs_sample2 %>% select(-c('IMMIG'))
fit.svm_new <-train(NOC_10~.,data=lfs_sample_noImmi,method='svmRadial',metric=metric,trControl=control)
predictions_svm_new <- predict(fit.svm_new, validation_2)
cm_svm_new <- confusionMatrix(predictions_svm_new, as.factor(validation_2$NOC_10))
# Comparing accuracy
Original <- fit.svm_2$results$Accuracy
Refined <- fit.svm_new$results$Accuracy
Original_Accuracy <- mean(Original)
Refined_Accuracy <- mean(Refined)
svm_compare <- data.frame(Original_Accuracy, Refined_Accuracy)
svm_compare
As seen in Table 13, the refined mean accuracy did not improve as expected for this particular model. To increase the predictive model’s classification performance and accuracy, more research and data exploration will be needed.
Our analysis aimed to predict occupation type (NOC_10)
using the variables available in the LFS dataset. In order to do this
effectively, we needed to carefully examine the relationships between
NOC_10 and the other variables, and determine which ones
were most important for our predictive modeling. We also had to consider
which variables provided useful information, and which ones might be
considered “noise” that could affect the accuracy of our
predictions.
As a result of our investigation, we have come to the view that it is not possible to confidently determine a personas occupation type based on the available predictors. This is likely due to the complex and varied nature of modern occupations, which cannot be accurately captured by a single algorithm or model. Our findings underline the need for additional study in this field to advance our knowledge of the variables influencing occupation type and enable more precise forecasts in the future.
The Labour Force Survey dataset from October 2022 was downloaded as a ‘.zip’ package from the Statistics Canada website https://www150.statcan.gc.ca/n1/pub/71m0001x/71m0001x2021001-eng.htm on November 23, 2022. Use of the product is governed by the Statistics Canada Open Licence Agreement.
This report is produced as a portion of the requirements of the Geospatial Data Analytics Graduate Certificate program at the Centre of Geographic Sciences, NSCC, Lawrencetown, NS.